現在人身處網路時代,每天都會收到一堆LINE、Email、Facebook、Instantgram、...等等五花八門的訊息或網頁,花整天看都消化不完,只好來個已讀不回、封鎖群組或者乾脆直接刪除,如此可能因而錯過一些重要的訊息,面臨這種『資訊過載』(Information Overload)的現象,如果,有個程式能像個小秘書,先幫我們讀過所有內容,並整理成摘要,那我們就只要快速瀏覽摘要,有興趣的項目再超連結(Hyperlink)進去,看詳細內文即可,就比較不會漏掉重要資訊了,因此,這次我們就來探討這個主題 -- Automatic Text Summarization,因為,中文直譯不好意會,我就擅自將標題改成『自動擷取摘要』,避免爭議,以下還是直接以英文引用。
記得幾年前新聞曾經報導,英國一位15歲青少年,寫了一隻程式,專門針對每日各大報新聞,擷取摘要,還吸引不少人訂閱,我們就來看看怎麼作,特別是以 Neural Network 方式處理。
『A Gentle Introduction to Text Summarization』 列舉了 Automatic Text Summarization 許多用途:
同時,Automatic Text Summarization 也有助於『問答系統』(Question-Answering system)的發展,因為,如果能掌握問題的大意,才能作適當的回答。一般而言,可分為兩種作法:
前者較簡單,較易成功,後者較難,但是它較接近人類思考的方式。Neural Network 模仿人類為主,故大部分採取後者,Facebook、IBM、Google都有相關論文發表,三巨頭都到齊了,這題目應該算熱門吧。
目前比較普遍的作法是與『[Day 18: 機器翻譯(Machine Translation)]』(https://ithelp.ithome.com.tw/articles/10194403) 類似的方式,採取 RNN seq2seq 演算法,以Encoder-Decoder的方式,以問題當編碼器encoder,再預測(解碼)答案。
我們先來看『萃取法』(Extractive Method)的步驟,詳請請參閱『Text Summarization Techniques: A Brief Survey』:
我們來看個範例程式,程式來源為『NLTK Essentials』一書第五章,它是使用 NLTK library,未牽涉到 Neural Network,部分技術可參閱上一篇說明。
import nltk
# 本文,大意是歐巴馬卸任
news_content='''At noon on Friday, 55-year old Barack Obama became a federal retiree.
His pension payment will be $207,800 for the upcoming year, about half of his presidential salary.
Obama and every other former president also get seven months of "transition" services to help adjust to post-presidential life. The ex-Commander in Chief also gets lifetime Secret Service protection as well as allowances for things such as travel, office expenses, communications and health care coverage.
All those extra expenses can really add up. In 2015 they ranged from a bit over $200,000 for Jimmy Carter to $800,000 for George W. Bush, according to a government report. Carter doesn't get health insurance because you have to work for the federal government for five years to qualify.
'''
# 分詞、標註、NER、打分數,依分數高低排列句子
results=[]
for sent_no,sentence in enumerate(nltk.sent_tokenize(news_content)):
no_of_tokens=len(nltk.word_tokenize(sentence))
# Let's do POS tagging
tagged=nltk.pos_tag(nltk.word_tokenize(sentence))
# Count the no of Nouns in the sentence
no_of_nouns=len([word for word,pos in tagged if pos in ["NN","NNP"] ])
#Use NER to tag the named entities.
ners=nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)), binary=False)
no_of_ners= len([chunk for chunk in ners if hasattr(chunk, 'label')])
score=(no_of_ners+no_of_nouns)/float(no_of_tokens)
results.append((sent_no,no_of_tokens,no_of_ners, no_of_nouns,score,sentence))
# 依重要性順序列出句子
for sent in sorted(results,key=lambda x: x[4],reverse=True):
print(sent[5])
程式可自『Packt』下載,程式名稱為 summarizer.py,原程式有些typo,我修正後再加幾行註解,可自這裡下載。
在DOS內執行下列指令:
python summarizer.py
程式很短,大約10行,處理流程如下:
依重要性順序輸出句子如下:
At noon on Friday, 55-year old Barack Obama became a federal retiree.
The ex-Commander in Chief also gets lifetime Secret Service protection as well as allowances for things such as travel, office expenses, communications and health care coverage.
In 2015 they ranged from a bit over $200,000 for Jimmy Carter to $800,000 for George W. Bush, according to a government report.
His pension payment will be $207,800 for the upcoming year, about half of his presidential salary.
Carter doesn't get health insurance because you have to work for the federal government for five years to qualify.
Obama and every other former president also get seven months of "transition" services to help adjust to post-presidential life.
All those extra expenses can really add up.
標題排在第一位,乍看之下,好像還蠻合理的,含金額數值的句子都排在前面(我比較愛錢XD)。自動摘要的簡單作法,就是選取前面幾句當摘要,這就是簡化的『萃取法』(Extractive Method)。
下次,我們就來看看 抽象法(Abstractive Method),也就是 Neural Network 作法。